Context
Why is this problem important to solve?
Climate change's threat looms ever nearer the entire globe. According to NASA, climate change will cause flooding, drought, heat waves, stronger hurricanes, and a higher sea level in the US. While some climate changes occur naturally, humanity's CO2 emissions already caused climate change that would not have happend otherwise. Susan Solomon et al states the concentration of CO2 in our atmosphere is already so large that the effects will be irreversable for 1,000 years even if all emissions were to stop today. In order to continue enjoying this beautiful planet in the long run, humanity must manage its resources more efficiently.
Objective
What is the intended goal?
The US Environmental Protection Agency estimates that approximately 25% of all greenhouse gas emissions (including CO2) come from electricity production. The same EPA article states that the US produces 62% of its energy by burning fossil fuels such as coal and natural gas. Predicting future CO2 emissions from different energy sources may help policy makers and innovators understand where to focus their green energy efforts.
Key questions
What are the key questions that need to be answered?
Problem Formulation
What is it that we are trying to solve using data science?
The purpose of this project is to identify potential opportunities to prevent the acceleration of CO2 emission rates from electricity generation.
This datset is the past monthly data of Carbon dioxide emissions from electricity generation from the US Energy Information Administration categorized by fuel type such as Coal, Natural gas etc.
Before we visualize and model our data, we must make sure there are no null or otherwise invalid arguments and format the data into a dataframe.
# Set up Google Colab.
from google.colab import drive
drive.mount('/content/drive')
# Avoid scroll-in-the-scroll for large outputs.
from IPython.display import Javascript
def resize_colab_cell():
display(Javascript('google.colab.output.setIframeHeight(0, true, {maxHeight: 5000})'))
get_ipython().events.register('pre_run_cell', resize_colab_cell)
# Import basic libraries.
!pip install statsmodels --upgrade
import pandas as pd
import warnings
import itertools
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from statsmodels.graphics import tsaplots
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from statsmodels.tsa.stattools import coint, adfuller
from statsmodels.tsa.ar_model import AutoReg
warnings.filterwarnings("ignore")
# Load the data.
df = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/MER_T12_06.xlsx')
# Convert dates into standard datetime
df['YYYYMM'] = pd.to_datetime(df['YYYYMM'], errors = 'coerce', format = '%Y%m')
# Drop rows with NaT in date column and set date as index.
df = df.dropna(subset = ['YYYYMM']).set_index('YYYYMM')
df.head(15)
# Check the info of the dataframe.
df.info()
Observations:
# Convert valid values in the Value column to float and drop any rows with invalid values in said column.
# Check the info of the dataframe.
df['Value'] = pd.to_numeric(df['Value'], errors = 'coerce', downcast = 'float')
df = df.dropna(subset = ['Value'])
df.head(15)
Observations:
Since the dataset has 8 different sources of carbon emissions, we first need to group by the description. Then we will be able to compare the sources to each other in our visualizatiuons. Once we do so, we can drop the MSN column from the dataset.
# Create a dataframe that groups by description and uses the description as the column names.
EnergyTypes = ['Coal', 'Distillate Fuel', 'Geothermal Energy', 'Natural Gas', 'Non-Biomass Waste',
'Petroleum Coke', 'Petroleum', 'Residual Fuel Oil', 'Total Emissions']
Emissions = pd.DataFrame(index = df.index.unique(), columns = EnergyTypes)
grouped = df.groupby('Description')
for g, c in zip(grouped.groups, EnergyTypes):
Emissions[c] = grouped.get_group(g).iloc[:, 1]
Emissions.head(15)
Observations:
# Drop rows with null values in Emissions.
EmissionsDropped = Emissions.dropna()
# View information on our new dataframe.
EmissionsDropped.info()
Observations:
# View the head of our data.
EmissionsDropped.head(15)
Observations:
Now that the dataframe is in a more user-friendly format, we are ready to create the different plots and graphs.
# Create boxplots to visualize each energy source's emissions over this timeframe.
EmissionsDropped.boxplot(figsize=(24, 12), grid = False, fontsize=12, color = 'orange');
Observations:
# Compare total emissions from each energy source in a bar plot.
EmissionsDropped.sum().plot.bar(figsize=(24, 12), fontsize=12, ylabel = 'CO2 Emissions (MMT)', xlabel = 'Energy Type', title = '1989 - 2016 CO2 Emissions by Energy Type');
Observations:
# Show the relationship between power generation carbon emissions and time in a line graph.
EmissionsDropped.plot(subplots = True, figsize = (24,24), xlabel = 'Year', ylabel = 'Carbon Emissions (MMT)');
Observations:
# Now we will view the rolling averages for all of the energy sources for the first 100 observations.
# Calculating the rolling mean and standard deviation for a window of 10 observations.
for c in EnergyTypes:
rolmean=EmissionsDropped[c].rolling(window=10).mean()
rolstd=EmissionsDropped[c].rolling(window=10).std()
#Visualizing the rolling mean and standard deviation
plt.figure(figsize=(16,8))
actual = plt.plot(EmissionsDropped[c], color='grey', label=c)
rollingmean = plt.plot(rolmean, color='blue', label='Rolling Mean')
rollingstd = plt.plot(rolstd, color='purple', label='Rolling Std. Dev.')
plt.title('Rolling Mean & Standard Deviation')
plt.legend()
plt.show()
Observations
For this project, we are only interested in analyzing natural gas CO2 emissions. Therefore, we need a dataframe which contains only the date as an index and the natural gas observations. When we went to visualize the data, we had to drop all of the observations measured before 1989 since some of the energy sources did not have data until then. However, natural gas has no missing values from the start of the original dataset, so we are going to use this to make our forcasting dataframe. Luckily, we still have observations from the entire daterange in our 'Emissions' dataframe!
# Create a new, natural gas-only dataframe based on the 'Emissions' dataframe.
NatGas = pd.DataFrame()
NatGas['Value'] = Emissions['Natural Gas']
NatGas.head(15)
# View the dataframe's information.
NatGas.info()
Observations:
# Splitting the dataset into 80% train and 20% test and visualizing them on the same plot.
train, test = train_test_split(NatGas, test_size=.20, train_size=.80, random_state=10, shuffle=False);
fig, ax = plt.subplots(figsize = (24, 8));
train.plot(ax = ax, color = 'grey');
test.plot(ax = ax, color = 'purple');
plt.legend(['Train Data', 'Test Data']);
Next, we will implement the Augmented Dicky-Fuller Test to test for stationarity. The test has the following conditions:
# We are going to visualize multiple plots of rolling means and standard deviations against the data.
# We will also use the Augmented Dicky-Fuller test more than once. To make this easier, we're going to make a function.
def MakePlot(x):
# Visualize the rolling mean and standard deviation with window = 12.
plt.figure(figsize = (24, 8))
plt.plot(x, color = 'gray', label = 'Values')
plt.plot(x.rolling(window = 12).mean(), color = 'blue', label = 'Mean', linestyle = '--')
plt.plot(x.rolling(window = 12).std(), color = 'purple', label = 'Std. Dev.', linestyle = ':')
plt.legend()
plt.show()
# Implement ADF test on the log data.
print(adfuller(x['Value']))
# Visualizing the rolling mean and standard deviation and performing the ADF test on the training data.
MakePlot(train)
Observations:
train.info()
train.head()
# Using seasonal_decompose to create a dataframe with the individual components of the decomposed time series.
decomposition = sm.tsa.seasonal_decompose(train.Value)
DecompData = pd.DataFrame()
DecompData['trend'] = decomposition.trend
DecompData['seasonal'] = decomposition.seasonal
DecompData['noise'] = decomposition.resid
# Visualize the components.
fig, (ax1, ax2, ax3) = plt.subplots(nrows=3, figsize = (24,24));
DecompData['trend'].plot(ax=ax1, title='Trend', color='orange');
DecompData['seasonal'].plot(ax=ax2, title='Seasonal', color='purple');
DecompData['noise'].plot(ax=ax3, title='Noise');
Observations:
# We will use logarithmic transformation to stabilize the standard deviation and then test again for stationarity using ADF.
# Perform the logarithmic transformation.
LogTrain = np.log(train)
print(LogTrain)
# Visualizing the rolling mean and standard deviation and performing the ADF test on the transformed training data.
MakePlot(LogTrain)
Observations:
While the standard deviation has stabalized, there is still strong seasonality, an upward trend, and a p-value of .97. It is now time to shift the series by one month and apply differencing using the lagged series.
# Calculate the first order difference.
Diff = LogTrain.diff(periods = 1).dropna()
# Visualizing the rolling mean and standard deviation and performing the ADF test on the differenced training data.
MakePlot(Diff)
Observations:
The graph of the rolling mean and standard deviation are both stabalized and our p-value is only .001. Since this is less than our alpha of .05, we reject the null hypothesis and conclude the series is now stationary. We are ready to begin modeling the data.
# Using seasonal_decompose to create a dataframe with the individual components of the decomposed time series.
decomposition = sm.tsa.seasonal_decompose(Diff.Value)
DecompData = pd.DataFrame()
DecompData['trend'] = decomposition.trend
DecompData['seasonal'] = decomposition.seasonal
DecompData['noise'] = decomposition.resid
# Visualize the components.
fig, (ax1, ax2, ax3) = plt.subplots(nrows=3, figsize = (48,24));
DecompData['trend'].plot(ax=ax1, title='Trend', color='orange');
DecompData['seasonal'].plot(ax=ax2, title='Seasonal', color='purple');
DecompData['noise'].plot(ax=ax3, title='Noise');
Observations:
There still appears to be some strong seasonality in the data, but our ADF p-value is already less than .05. Depending on how the AR, MA, ARMA, and ARIMA models perform, we may have to test other model types.
# creating two subplots to show ACF and PACF plots
fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(20, 12))
# creating and plotting the ACF charts to lag=12
tsaplots.plot_acf(Diff, zero=False, ax=ax1, lags = 12)
# creating and plotting the ACF charts to lag=12
tsaplots.plot_pacf(Diff, zero=False, ax=ax2, lags = 12)
plt.show()
Observations:
#Apply AR model
plt.figure(figsize=(16,8))
model_AR = AutoReg(Diff, lags=1) #Use number of lags as 1 and apply AutoReg function on Diff series
results_AR = model_AR.fit() #fit the model
plt.plot(Diff)
predict = results_AR.predict(start=0,end=len(Diff)-1) #predict the series
predict = predict.fillna(0) #Converting NaN values to 0
plt.plot(predict, color='red')
plt.title('AR Model - RMSE: %.4f'% mean_squared_error(predict,Diff['Value'], squared=False)) #Calculating rmse
plt.show()
results_AR.aic
Observations:
So far we are off to a solid start with a low RMSE of 0.1539 and an AIC of about -369.51.
#Apply MA model.
plt.figure(figsize=(16,8))
model_MA = ARIMA(Diff, order=(0, 0, 1)) #Using p=0, d=0, q=1 and apply ARIMA function on Diff series
results_MA = model_MA.fit() #fit the model
plt.plot(Diff)
plt.plot(results_MA.fittedvalues, color='red')
plt.title('MA Model - RMSE: %.4f'% mean_squared_error(results_MA.fittedvalues,Diff['Value'], squared=False))
plt.show()
results_MA.aic
Observations:
While the MA model has both a higher RMSE and a higher AIC than the AR model.
# Apply the ARMA model.
plt.figure(figsize=(16,8))
model_ARMA = ARIMA(Diff, order=(1, 0, 1)) #Using p=1, d=0, q=1 and apply ARIMA function on Diff series
results_ARMA = model_ARMA.fit() #fit the model
plt.plot(Diff)
plt.plot(results_ARMA.fittedvalues, color='red')
plt.title('ARMA Model - RMSE: %.4f'% mean_squared_error(results_ARMA.fittedvalues,Diff['Value'], squared=False))
plt.show()
results_ARMA.aic
Observations:
This model performed slightly worse than the AR model with an AIC of -369.32.
#Apply the ARIMA model.
plt.figure(figsize=(16,8))
model_ARIMA = ARIMA(Diff, order=(1, 1, 1))
results_ARIMA = model_ARIMA.fit() #fit the model
plt.plot(Diff)
plt.plot(results_ARIMA.fittedvalues, color='red')
plt.title('ARIMA Model - RMSE: %.4f'% mean_squared_error(results_ARIMA.fittedvalues,Diff['Value'], squared=False))
plt.show()
results_ARIMA.aic
Observations:
ARIMA performed the worst of the for with a slightly higher RMSE and the highest AIC of -364.21. We will move forward with the AR model as it has the best RMSE and AIC.
# Printing the fitted values
predictions=pd.Series(results_AR.fittedvalues)
#First step - doing cumulative sum
predictions_cumsum = predictions.cumsum() # use .cumsum fuction on the predictions
#Second step - Adding the first value of the log series to the cumulative sum values
predictions_log = pd.Series()
predictions_log = pd.Series(LogTrain['Value'].iloc[13], index=LogTrain.index)
predictions_log = predictions_log.add(predictions_cumsum, fill_value=0)
#Third step - applying exponential transformation
predictions_AR = np.exp(predictions_log) #use exponential function
#Plotting the original vs predicted series
plt.figure(figsize=(16,8))
plt.plot(train, color = 'c', label = 'Original Series') #plot the original train series
plt.plot(predictions_AR, color = 'r', label = 'Predicted Series') #plot the predictions_ARIMA
plt.title('Actual vs Predicted')
plt.legend()
plt.show()
Observations:
While this model is doing a decent job of capturing the overall trend of the data, it is not accurately reflecting its seasonality. Let's see how it performs against the test data.
#Forecasting the values for the test data
forecasted_AR = pd.DataFrame()
forecasted_AR['forecasted'] = results_MA.forecast(steps=105)
#First step - doing cumulative sum
forecasted_cumsum = forecasted_AR.cumsum() # use .cumsum fuction on the predictions
#Second step - Adding the first value of the log series to the cumulative sum values
forecasted_log = pd.Series()
forecasted_log = pd.Series(LogTrain['Value'].iloc[417], index=forecasted_cumsum.index)
forecasted_log = forecasted_log + forecasted_cumsum['forecasted']
#Applying exponential transformation to the forecasted log values
forecasted_AR = np.exp(forecasted_log) #use exponential function on forecasted data
#Plotting the original vs predicted series
plt.figure(figsize=(16,8))
plt.plot(NatGas, color = 'c', label = 'Original Series')
plt.plot(predictions_AR, color = 'r', label = 'Prediction on Train data') #plot the predictions_ARIMA series
plt.plot(forecasted_AR, label = 'Prediction on Test data', color='b') #plot the forecasted_ARIMA series
plt.title('Actual vs Predicted')
plt.legend()
plt.show()
mean_squared_error(predictions_AR, train, squared = False) #calculate RMSE using the predictions_ARIMA and df_train
mean_squared_error(forecasted_AR, test, squared = False) #calculate RMSE using the forecasted_ARIMA and df_test
Observations:
The graph clearly demonstrates that our model failed to capture the seasonality of the data. This is confirmed by the fact that the forecasted RMSE is nearly double that of the training RMSE. Our model is overfitting.
We are going to brute force test for all combinations of all values of p, d, q, P, D, and Q between 0 and 2. While there may be better parameters out there, this has to be weighed against the time and resources cost of performing the test.
# Create a list of all possible parameter combinations.
p = d = q = range(0, 2)
Vals = list(itertools.product(p, d, q))
SarimaValues = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]
# Perform logarithmic transformation on the original data.
Log = np.log(NatGas)
SarimaAics = []
for Param in Vals:
for SeasonalParam in SarimaValues:
try:
model_SARIMA = sm.tsa.statespace.SARIMAX(Log, order = Param, seasonal_order = SeasonalParam, enforce_stationarity=False, enforce_invertibility=False)
SarimaResults = model_SARIMA.fit()
print('ARIMA{}x{} - AIC:{}'.format(Param, SeasonalParam, SarimaResults.aic))
if SarimaResults.mle_retvals is not None and SarimaResults.mle_retvals['converged'] == False:
print(SarimaResults.mle_retvals)
SarimaAics.append(SarimaResults.aic)
except:
continue
SarimaAics.sort()
print(SarimaAics[0])
Observations:
Based purely on AIC, our best parameter combination is (1, 1, 1) x (1, 0, 1, 12) with an AIC of -1132.67. Before moving forward, though, we need to check and make sure that the residuals of the mean are uncorrelated and normally distributed with a zero-mean.
model_SARIMA = sm.tsa.statespace.SARIMAX(Log, order = (1, 1, 1), seasonal_order = (1, 0, 1, 12))
SarimaResults = model_SARIMA.fit()
print(SarimaResults.summary())
Observations:
SarimaResults.plot_diagnostics(figsize=(15, 12))
plt.show()
Observations:
Since we are going to test both the one-step-ahead forecast and the dynamic forecast as well as predict beyond the dates in our dataset, there are some processes we will perform repeatedly. To be more efficient, we will define functions to call later.
# This function gets predictions' confidence intervals, exponentiates them, plots them against the actual values, and prints the RMSE and MSE.
def PredResults(pred, errors = True):
# Get the fitted predictions.
predictions_SARIMAX = np.exp(pred.conf_int())
#Plot predictions agains values.
ax = NatGas['1973':].plot(label='observed',figsize=(24, 8))
np.exp(pred.predicted_mean).plot(label='Forecast', ax=ax, color = 'r')
ax.fill_between(predictions_SARIMAX.index,
predictions_SARIMAX.iloc[:, 0],
predictions_SARIMAX.iloc[:, 1], color='y', alpha=.5)
ax.set_xlabel('Time (years)')
ax.set_ylabel('NG CO2 Emissions')
plt.legend()
plt.show();
# Print the MSE and RMSE for predictions. If a forecast, check the p-values.
if errors == True:
NatGas_forecast = np.exp(pred.predicted_mean)
NatGas_truth = NatGas.Value.iloc[418:523]
mse = ((NatGas_forecast - NatGas_truth) ** 2).mean()
print('The Mean Squared Error (MSE) of the forecast is {}'.format(round(mse, 2)))
print('The Root Mean Square Error (RMSE) of the forecast: {:.4f}'
.format(np.sqrt(sum((NatGas_forecast-NatGas_truth)**2)/len(NatGas_forecast))))
else:
print(SarimaResults.pvalues)
return(pred, predictions_SARIMAX);
# Get predictions with a one-step-ahead forecast.
OneStep = PredResults(SarimaResults.get_prediction(start = 418, end = 522, dynamic=False))
Observations:
While the RMSE is higher for the SARIMA model than it was for the AR model, we can clearly see from the graph that the SARIMA model did a much better job capturing the magnitude of the data's seasonality. Maybe a dynamic model will perform better.
# Get predictions with a dynamic forecast.
Dynamic = PredResults(SarimaResults.get_prediction(start = 418, end = 522, dynamic=True))
Observations:
The dynamic SARIMA's MSE and RMSE are 4 times and 2 times greater than those of the one-step-ahead forecast, respectively. We will therefore use the one-step-ahead forecast to predict the next 2 years.
# Forecast the next two years with a one-step-ahead forecast.
Forecast = PredResults(SarimaResults.get_forecast(steps = 24, dynamic=False), False)
Final Solution Design:
Refined Insights: